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(57) Abstract: In order to alleviate problems caused by delivery of unwanted or unsolicited email (spam), email traffic is analysed 
for patterns of traffic which indicate or suggest that the emails are spam; when the system detects a pattern it thinks is spam it can 
take remedial action, e.g. blocking delivery of the emails involved, either itself or to a human operator. Analysis of email takes 
place by scanning a database of data abstracted from emails. These data are primarily abstracted from the emails when regarded as 
"containers" (i.e. without reference to the message contents). 
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For two-letter codes and other abbreviations, refer to the "Guid- 
Published: published ance Notes on Codes and Abbreviations " appearing at the begm- 

. wll hout international search report and to be republished ance .o ^ ^ ^ 

upon receipt of that report 
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A METHOD OF, AND SYSTEM FOR, PROCESSING EMAIL 
« PARTICULAR TO DETECT UNSOLICITED BULK EMAIL 

Th e present invention relates ,o a method of, and system for, process 

I t unwantedor U nso.ic,tedoo 1 ™ereiaU m ail( U CE)a I1 d m a, 1 bor„bs. 

Atypical UCEorUREconsistsoftens, hundreds, thoasandsormorecop.es 

<wn«n Due to the nature of the task, these emails are 
20 SPam ' T.eenjoynrentandnse^essofemainsh^hytheincreasingarnoon, 

ofspam - a^-^^---^^^"" 

, an isp (or end user) may use software that implements "spam fdters . These 

* delivery of them once a threshold is reached, 
at a certain destination, and block delivery oi 

In our copending British Patent Application No. 0016835.1, filed y 
for looking for, and acting upon, traffic patterns that indicate, 
2000, we propose a system for looking ior, 

■• nf a vims bv email. The present invention relates to the 
or suggest, the transmission of a virus by em*i 
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appli ca ti ono f ,ha tt ec taiqU e t o th e, tatifi ca tio no f sp amin c,d,„ g UBE,UCE^ m an 

b ° mbS Accor dingto,hepre S e„ l inven t ion t he rel sprovideda 1 ne 1 hodofprocessi„g 
^cJo.— e.aiUnd.o^^apane.isd^ed,— antomahc 

Cle.aiUndo.es^ap^M.acM^— cremedra, 

Thls system thusprovidesaway of identifying and stopping snch unwanted 
Ceve, Ho^.^ean.sotiesca.eddowntoaeanatu.e.SP.evei.otevena.a 

"•^JllMetnet,,, — of .ftic in ot* Beaten. 

Internet This expression is equally applicable to the present invention. 

^Lpresentinventlon.eaehrnanisanatyaedprtinarilyatti.eoontarner 

level and .fl^ly to be spam, logged. If srmi.ar emails are detected, then the system 

^.to-hc-sparn'scoreandthenunthercfentailsreceived. Thus, some spam may be 
atthetirstemail. Others may* lOsor .00, The system oan be hmed so that 

30 spammers. ^ ^ ^ ^ ^ ^ fey ^ of non , imitative example 

with reference to the accompanying drawings, in which:- 

Frgnrelillustrateslheprocessofsendinganemailoverthelntemenand 
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and UCCP. There are also other way 

• on ISDN or leased line connection instead 01 

Suppose a user lAwi ^ ^ 

■.source.com^^^^^ 

„ j *w thPse com domains are mamtamea oy k 

(In-etServiceProv.a.^ E * ssages and one „ r m ore POP3 s^ers 

4A> 4B for inbound ones, l&ese 
15 ^ P P , , dresses itto » areC ipient@adestinat 1 on.com . 

transferring the email 10 to the server 3A ^ ^ ^ ^ a) 

4 . The SMTP server 3A parses ^ ^ ^ ^- or ^ p rese nt purposes 

25 thatthesender'sandrecp.ents ISPsar ^ 4A for subsequ ent 

simp , y ronte the email through to ttsassocta^dPOPB serv 

C * CUOn ' 5 TheSMTPserverSAlooatesanlntemetDomainNameserverand 

obtains an IP address for the destination ^"^'^^^^^p serve r 3B at 

6 Xhe SMTP S '™ er ^ C °"°^ " 'I recipient addresses and message 
m » via SMTP and sends it the senuei 
"adestination.com viabivxir 

body similarly to Step 3. 
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, The SMTPserve I 3Br=cogn 1S es l ha t t 1)e < i o m a I n n a ra e re fe t sto i «se lf , 
„, -ihnv for collection by the recipients email client IB. 

, ,„ m t„ assess whether they are candidates for loggmg; 

As eareher24,wh i chsea I ,s n ewent ri es ra theda«base23search mff or 
^^I^S.wMeh^^^ 

^^'JLo.-* — -enao^edeco^ser.as 

Ato leana ly ser28(optio„a 1 )whieh 1 o g sn 1 a !l *a.houneesto,he 

" ThemessagedecomposerZ-yserl^eeomposesemamintodaeir 

• , landanaLsdtemt.assesswhether.heyareca.didatesfcr.ogg.ng. 

feedback from the stopper 25. ^ 

spam . ThefoUowingtsanon— ve hs<~ ywh 
in order to mrp.e.ent these heuristics. Other cntena may be 
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1 . It is addressed to many recipients. 

The addresses can be determined by parsing fields, such as To, Cc and Bcc 
in the email header and by analysing the email envelope. The number of addresses can 
simply be counted. 

5 

2. It is addressed to recipients or organisations in a) alphabetical or 
b) reverse alphabetical order. 

Once the addresses have been extracted as per Item 1 above, it is a simple 
matter to determine whether they are in any of these orders. Any ordering suggests that 
1 0 the addressee list was derived from a mailing list, possibly of the sort commonly used to 
generate bulk emails. 

3. It contains structural quirks 

Most emails are generated by tried and tested applications. These 
15 applications will always generate email in a particular way. It is often possible to identify 
which application generated a particular email by examining the email headers and also be 
examining the format of the different parts. It is then possible to identify emails which 
contain quirks which either indicate that the email is attempting to look as if it was 
generated by a known emailer, but was not, or that it was generated by a new and unknown 
20 mailer, or by an application (which could be a virus or worm). All are suspicious. 

Examples: 

Inconsistent capitalisation 

from: alex@star.co.uk 
25 To: alex@star.co.uk 

The from and to have different capitalisation 

Non-standard ordering of header elements 

Subject: Tower fault tolerance 
30 Content-type: multipart/mixed; boundary = "= = = = = =_962609498 = =_" 

Mime-Version: 1 .0 

The Mime-Version header normally comes before the Content-Type header. 



PCT/GB02/00926 

WO 02/071286 -6- 



Missing or ■ - ---^ ^ ^ ^ Version 3 . 0 . 5 (32) 



r additional header elements 

X-Maiter: 

Date : Mon. 03 Jnl 2000 12:24:,7 +O100 

Eudora normally also includes an X-Sender header 

5 



4 it contains unusual message headers 
TO, would include headers tha, are rarely or never generated by normal 

5 ltorig i n a,esfro m par.icn.arIPaddressesorIPaddressra„ges. 

ThelPaddressof.heorienator.s.ofconrse.knownandhencecanheused 

to determine whether this criterion is met. 

6 It contains specialised constructs 

SomeemailusesHTMLscripttoencryptthemessagecontent. TOsrs 

n0Iinal email ^ _ ^ „ w wb pages to ^ whether the email 
20 hasheenread. 1. would be nnusuaUor a normal email to do <h,s. 

7 The text body is susceptible to particular linguistic analysis. 
Once the text body has been parsed out of the email it can be analysed and 

scored in a variety of ways, for example: 
, analysis by reference to established styhsuc and 

content metrics, for example Gunning's Fog Index or Fry's 
Readability Graph. Analysis can establish whether the style 
indicates that it originated in the scientific community, the 
civil services, etc. 

3Q _ analysis to determine whether the message body 

contains certain keywords or keyphrases. 
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S Empty message sender envelopes 

An email norma.ly indica.es the originator in the Sender lex. field and spam 
originators wiUoft^abogus entry in thatfield to disguise the fact tha, the email ,s 
I, However, the Sender identity is also supposed to be specified in *epro,oco, under 
5 IcnSMTPprocesses.aiUooneanomer.nthetransferofema^ndm.scriter.on.s 
„d W1 « h mea b se„ceo f th=senderide„« 1 fica t ,on f rom*ere,eva„tprotoco,s,o, 

namely the Mail From protocol slot. 

9 Invalid message sender email addresses 
This is complementary to item 8 and involves consideration of both the 
send er fieldofthe message and, he sender protocol slo, as. o whether i, is invalid. The 
Ilay come fromadomain which does notexis. or does no. follow .he normal rules 

HotMail addresses cannot be all numbers. 
15 Anumb eroffteldsofmeemailmaybeexaminedforinvahdentnes, 

including "Sender", "From", and "Errors-to". 

,0. Message sender addresses which do no, match the mail server 
from which the mail is sent 

The local mail server knows, or at least can find ou, from the protocol .he 
aodressofd.ema.lsender.andsoade.erminationcanbemadeofwhetherthisma.chesmc 

sender address in the mail text. 

1 1 Message has a particular container format. 

AnemailhasaspecificnumberofaUachmentstcurrentlyspamnsnallyhas 

.neirlilceliboodofindicatingspam. Other similar characters wh.ch can be assessed 

include: . 

the "message boundary" which the email specifies in 

30 the header as a delimiter of subsequent fields of the message. 

the "message ID" which is supposed to be a text stnng 
which uniquely identifies a particular instance of an email. 
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Bulk mail may contain .he same message ID in some or ail 
email instances. 

Each of the above criteria is assigned a numerical score, and an algorithm is 
5 csedbyanalyser^ltode.enninewhetherthismailisaeandidateforloggin.Th.s 
u H need to evolve over time to track changes in spammmg patterns. The 

"ehncdenonnnedauhasenandimprovespe— ce. However Urns step rs 
:r:;niremen, Thesystemwill^perfectlyweinfallemansarelogged. A 

10 SmP,,S,1Ca,B0 2nCI:a— , d onot,og(s P amma«doesno, 

1 /-" SM %:«:Les ltW asge„era,edhyacommonma„c,ientsnchas 
0 n,lookorEndora,dono,lo g (spammai,isgenera,.y g enera,edbyaspecia,istpacka g e). 

E.chUCEMai^ombpackagewil.eons.ructtheemai.sinacertamway, 
and by analysmgthernessagecontaineritis possible to .dentify the mail as bemg 

90 release versions of the generator package. 

th eemaU orsimilaremailMftheyrecnr. The values m ay include, but are noticed to: 
Thesubiectline.digestofsubiectline.digestofpartials^ecthne. 

Digest of text, digest of first, middle and last part of text. 

25 Sender 

Originating IP address 

Path mail has taken 
Structural format indicators 
Structural quirk indicators 
The digests may be of MD5 type, i.e. text strings derived nsing a one way 
hashing taction from the field in question. 

The logger 22 will log these to the database, together wrth other factors 
which may help future analysis, such as: 
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Time of logging 

Linguistic analysis indicators 

„rfal As regards multi-tier logging, it vs possible to 

•••---r:— is.*:—— — . 

10 The searcher 2 p comp onents. Dependmg on the 

si „ge sandg— £^^^ A «*« 
score, the system may tdenhfy a ^ tot 

)5 stopper 25 so that the next message wtththa potential threat can 

and see. The stopper 25 responds appropnate.y to the ope 
20 " ^followingcrUenacanhensed^emultipleemaineveh 
They contain the same, or similar subject toe 
They contain the same or similar body text 

They are addressed to many recipients 
Theyareaddressed,orecipientsina,pbabehcal,orreverse 

alphabetical order 
They contain the same structural format 
They contain the same structural quirks 
They contain the same unusual message headers 

30 They contain specialised constructs 

The body text is susceptible to linguistic analyse 

Empty message sender envelopes 
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Invalid message sender email addresses 

Message senders addresses which do not match the mail server from 

which the mail is arriving 
Number of bounces of this email, and reason for bounce 
5 They come from the same IP address, but have different sender 

addresses 

The searcher 24 can be configured with different parameters, so that it can 
be more sensitive if searching logs from a single email gateway, and less sensitive if 
processing a database of world-wide information. 
! 0 Each criterion can be associated a different score. 

The time between searches can be adjusted. 

The time span each search covers can be adjusted and multiple time 

spans accommodated. 
Overall thresholds can be set 
j 5 The stopper 25 takes signatures from the searcher 24. The signature 

identifies characteristics of emails which must be stopped, or which must be investigated 
further. On receiving a stop signature, all future emails matching this signature as detected 
by the analyser 21 are stopped. Current queued emails matching this signature are deleted 
by the purger. Old stopper signatures are periodically deleted. 
20 On receiving an investigation signature, the next email that matches this 

signature is investigated more fully, and the signature then discarded. Depending on the 
time needed, this investigation need not interrupt the flow of mail - the mail in question 
can be copied and analysed either by a separate process on the mail server, or even on 
another machine. Since many mail servers may receive an email matching the signature at 
25 roughly the same time, the recommended approach is for these machines not to do the 
analysis themselves, but to copy the mail to another machine for analysis. This does not 
impact the flow of mail, and ensures that analysis work is not duplicated. If analysis work 
proves to be time-consuming, it is also recommended that the logger 22 flags that the 
particular mail is now under analysis. The stopper 25 can then update all the other mail 
30 servers so that they do not try and analyse the same email. The results of the analysis are 
then passed back to the logger 22. 

The bounce analyser 28 signals to the logger 22 if an email cannot be 
delivered to the next mailserver in the delivering route. Normally, only emails which have 
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WOOJ/OTO 86 .1). 

, , 21 as 'interesting' need be legged. To make the 
*-y<-«r?»^ L^e^gedOniveertainnon-de^eonntttons 
system more sensitive " „ is not available, this is not 

^gp— ismadetot— remans). 



15 
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CLAIMS 

A method of processing email which comprises monitoring email traffic 
assingthroughoneormorenodesofanetworh for pat^s of emai, traffic which are 
5 LIveoforsuggestiveoMmaashotofunsohcitedor — 

A method according to claim . which composes decomposing each email 
, • ort r more of the decomposed constituent parts tor 

. — ~ 

of the decomposed email to a database. 

An.thodaccordingtoclaimZ.whereinda.aisioggedonlvinrespectof 
eraailwh.ch, on anaiysrs, meets a, leastone criterion raet by email be.onging to such a 
15 mailshot. 

Amethod accordingto clarm 1, 2 or 3 and inclodingthe step of deiivenng, 
orforwardingfordelivery.emailnotconsideredtobeiongtosnchamailshot. 

Amethodaccordingtoclarml.SorAandindudingmestepofcontmnaPy 

email traffic tahen to he indicative of, or suggestive of such a marlshot. 

Ame.hodaccordmgtoc.aimS.whereinthedambasealgorithmexamines, 

been added less than a predetermined time ago. 

, A method according to any one of the preceding claims wherein the 

corre c,ive action inCudes any or a„ of the following, in re.ation to each emai, whtch 

30 conforms to the detected pattern-. ftWmai i s 

a) at least temporarily stopping the passage of the emads 

b) notifying the intended recipients) 
e) generating a signal to alert a human operator. 



20 5. 
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tmf „r processing email which comprises means for monitoring email 
8. Asystemfot process, g for paUe ms of email traffic which 

---- 



both. 



™* M to claim 9 and including means for continually or 

1L , ■ PPntW " added database entries, i.e. entnes which have 

principally or exclusively, only recently added 

been added less than a predetermined time ago. 

t i,im 9 10 or 11, wherein data is logged only in 
^ Illone ri terionmethyemai,he,on g ing«o 
respect of email which, on analysts, meets at least 
such a mailshot. 

a- ntnri a im9 10, 11, or 12 and including the step of 

nf claims 8 to 13 wherein the corrective 
A svstem according to any one of claims 



30 



dete otedpattern. 

b) notifying the intended recipient) 

c) generating a signal to alert a human operator. 
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